Sources:
I felt compelled to undertake the wine dataset. Having studied in Reims (Capital of Champagne), I had some (crazy) hope that the white wine dataset would somewhat be about champagne. Unfortunately, it was not. Since I wanted to make some pretty charts, I decided to mix the white wine and the red wine dataset.
More information about the datasets can be found here: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
Let’s use R to make our Exploratory Data Analysis and learn a bit more about this dataset and what makes a good wine!
First, Let’s see what do we have in the dataset:
## [1] 6497
## [1] 14
This dataset consists of 14 variables, with almost 6,500 wines observations (1599 red wines and 4898 white wines). Let’s have a deeper look to our variable names:
## 'data.frame': 6497 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ color : chr "white" "white" "white" "white" ...
Let’s rename somne variables (because yes, we are lazy):
## X fixed.a volatile.a citric.a
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.s chlorides free.sd total.sd
## Min. : 0.600 Min. :0.00900 Min. : 1.00 Min. : 6.0
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0
## Median : 3.000 Median :0.04700 Median : 29.00 Median :118.0
## Mean : 5.443 Mean :0.05603 Mean : 30.53 Mean :115.7
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0
## Max. :65.800 Max. :0.61100 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50
## Median :0.9949 Median :3.210 Median :0.5100 Median :10.30
## Mean :0.9947 Mean :3.219 Mean :0.5313 Mean :10.49
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30
## Max. :1.0390 Max. :4.010 Max. :2.0000 Max. :14.90
## quality color
## Min. :3.000 Length:6497
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.818
## 3rd Qu.:6.000
## Max. :9.000
Let’s drop the X variable which is basically just the ID of the wine (especially irrelevant since we binded both dataframes)
Let’s see if we have any missing values in our dataframe!
## fixed.a volatile.a citric.a residual.s chlorides free.sd
## 0 0 0 0 0 0
## total.sd density pH sulphates alcohol quality
## 0 0 0 0 0 0
## color
## 0
No data point missing!
Let’s dive deeper into the dataset!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.818 6.000 9.000
The quality follow a normal distribution. the quality range from 3 to 9 with the median at 6 and the mean at 5.818.
Apparently, no whine is perfect!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
The alcohol distribution is right-skewed. The alcohol range from 8% to 14.9% with the median at 10.30 and the mean at 10.49.
This makes sense since it is pretty rare to have wines under 8% (because it’s hard to make even though it exists) or over 16% (because of tax reasons).
It would have been interesting to have region of origin to check my stereotype: Wines with less than 11% of alcohol comes from fresh climates. Wines with more than 13% of alcohol comes from hot climates.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000
The sulphates distribution is right-skewed. the sulphates range from 0.22 to 2 with a median at 0.51 and mean at 0.5313.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.110 3.210 3.219 3.320 4.010
The pH follow a normal distribution.the pH range from 2.72 to 4.01 with a median at 3.21 and a mean at 3.219. It has a few outliers.
Wine has quite a low pH compared to water (7.0), wich should not be surprising.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390
The density follow a normal distribution with an extreme outlier at 1.039. The density range from 0.9871 to 1.039 with a media at 0.9923 and a mean at 0.9947.
Very few wine seems to have a higher density than water (1.0).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 77.0 118.0 115.7 156.0 440.0
The total sulfur dioxide distribution appears to follow a bimodal distribution with modes around 20 and 120. The sulfur dioxide range from 6.0 to 440.0 with a median at 118.0 and a mean at 117.7.
Interestingly, a total sulfur dioxide above 50 ppm (or mg / dm^3) affect the taste of the wine. Will it impact quality?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 29.00 30.53 41.00 289.00
The free sulfur dioxide follow a right-skewed distribution with some extreme outliers. The free sulfur dioxide range from 1 to 289 with a median at 29 and a mean at 30.53.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
The chlorides follow a right-skewed distribution with some extreme outliers. the chlorides range from 0.09 to 0.611 with a median at 0.047 and a mean at 0.05603.
A wine should not really be salty, I wonder if the concentration of chlorides will have an effect on the quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800
The residual sugar follow a right-skewed distribution with some extreme outliers. The residual sugar range from 0.6 to 65.8 with a median at 3 and a mean at 5.443.
I wonder if we an find a “sweet” spot between quality and residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2500 0.3100 0.3186 0.3900 1.6600
The citric acid follow a normal distribution with some extreme outliers. The citric acid range from 0 to 1.66 with a median at 0.31 and a mean at 0.3186.
Citric acid is apparently good in small quantities to add freshness to wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800
The volatile acid follow a right skewed distribution. The volatile acid range from 0.08 to 1.58 with a median at 0.29 and mean at 0.3397.
Too much volatile acid can impact the taste of the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.400 7.000 7.215 7.700 15.900
The fixed acid distribution is right-skewed. The fixed acid range from 3.8 to 15.9 with a median at 7.0 and a mean at 7.215.
There are 6497 wines in the dataset with 13 features (fixed.a, volatile.a, citric.a, residual.s, chlorides, free.sd, total.sd, density, pH, sulphates, alcohol, quality, color).
Other observations:
The main features in the the data set are quality and alcohol. I would like to determine which feature are the best for predicting the quality of a wine. I believe that some other features might have an impact on the quality of the wine. I also wonder if white and red wines have different quality profiles.
pH, Volatile Acid, Citric Acid, Residual Sugar, Chlorides, Total Sulfur Dioxide. I think the pH and the residual sugar would have the more effect on the wine quality (hint: I was wrong!).
I want to create to create a new variable ~ rating: I will separate it into 3 categories: * Quality: under 4 (included) -> “Poor” * Quality: between 5 to 6 -> “Average” * Quality: over 7 (included) -> “Good”
We will then have 14 variables in our dataset.
There are less bad wines than good wines.
Quite a few (7) of the distributions are right_skewed: * alcohol * sulphates * free sulfur * chlorides * residual sugar * volatile acid * fixed acidity
Most (10) of the distributions have outliers: * sulphates * pH * density * total sulfur dioxide * free sulfur dioxide * chlorides * residual sugar * citric acid * volatile acid * fixed acidity
One distribution is bimodal: * total sulfur dioxide
I did not make any change to the distribution for the univariate analysis.
This is a very interesting chart:
Relationships related to quality I want to explore. * Quality is positively correlated with alcohol. * Quality is slightly negatively correlated with density, volatile acidity and chlorides.
Additional relationship not related to quality I want to explore. * Alcohol is strongly negatively correlated with density. * Density is strongly positively correlated with residual sugar, fixed acidity.
Other notable relationships. * Free sulfur dioxide is positively correlated with total sulfur dioxide and residual sugar.
Let’s look in more details at these relationships
## mixed_wines$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.625 10.150 10.215 11.000 12.600
## --------------------------------------------------------
## mixed_wines$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.00 10.18 10.90 13.50
## --------------------------------------------------------
## mixed_wines$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.300 9.600 9.838 10.300 14.900
## --------------------------------------------------------
## mixed_wines$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.50 10.59 11.40 14.00
## --------------------------------------------------------
## mixed_wines$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.62 11.40 11.39 12.30 14.20
## --------------------------------------------------------
## mixed_wines$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.68 12.60 14.00
## --------------------------------------------------------
## mixed_wines$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
##
## Pearson's product-moment correlation
##
## data: mixed_wines$alcohol and mixed_wines$quality
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4245892 0.4636261
## sample estimates:
## cor
## 0.4443185
The boxplot provide us with a interesting insight: * Low quality wines in general have less alcohol than their better ranked counterparts.
This is confirmed by a Pearson’s correlation coeficient of 0.443185 which show a moderately strong positive correlation.
This confirms that alcohol has a important impact on the quality of the wine.
This visualization show even more clearly that the good wines have higher alcohol concentration than average and bad wines.
Not surprisingly, the wine’s color does not influence the quality of the wine.
Interesting to notice than only white wines got a 9 mark in quality. However than might be due to the difference in size of the 2 datasets.
We still need to verify if good red wines share the same characteristic than good white wines.
##
## Pearson's product-moment correlation
##
## data: mixed_wines$density and mixed_wines$quality
## t = -25.89, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3277372 -0.2836508
## sample estimates:
## cor
## -0.3058579
The boxplot provide us with a interesting insight: * Low quality wines in general have a higher density than their better ranked counterparts.
This is confirmed by a Pearson’s correlation coeficient of - 0.3058579 which show a moderately strong negative correlation.
This confirms that density has a important impact on the quality of the wine.
##
## Pearson's product-moment correlation
##
## data: mixed_wines$volatile.a and mixed_wines$quality
## t = -22.212, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2881545 -0.2429524
## sample estimates:
## cor
## -0.2656995
The boxplot provide us with a interesting insight: + Low quality wines in general have a higher volatile acidity than their better ranked counterparts.
This is confirmed by a Pearson’s correlation coeficient of - 0.2656995 which show a slight negative correlation.
This confirms that volatile acidity has a important impact on the quality of the wine.
##
## Pearson's product-moment correlation
##
## data: mixed_wines$chlorides and mixed_wines$quality
## t = -16.508, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2238898 -0.1772134
## sample estimates:
## cor
## -0.2006655
The boxplot provide us with a interesting insight: * Low quality wines in general have a higher chlorides than their better ranked counterparts.
This is confirmed by a Pearson’s correlation coeficient of - 0.2058579 which show a slight negative correlation.
This confirms that chlorides has a important impact on the quality of the wine.
##
## Pearson's product-moment correlation
##
## data: mixed_wines$alcohol and mixed_wines$density
## t = -76.14, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6993829 -0.6736787
## sample estimates:
## cor
## -0.6867454
The scatter plot provide us with a interesting insight: * The more alcohol, the more density.
This is confirmed by a Pearson’s correlation coeficient of - 0.6867454 which show a strong negative correlation.
This is interesting because both alcohol and density are correlated with wine’s quality but they are also heavily correlated between themselves. It will probably not be possible to throw them together in a model to predict wine’s quality.
##
## Pearson's product-moment correlation
##
## data: mixed_wines$residual.s and mixed_wines$density
## t = 53.423, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5353934 0.5691865
## sample estimates:
## cor
## 0.552517
The scatter plot provide us with interesting insights: * The more residual sugar, the more density. * Seems that there are two different trend in the data.
This is confirmed by a Pearson’s correlation coeficient of 0.552517 which show a strong positive correlation.
##
## Pearson's product-moment correlation
##
## data: mixed_wines$fixed.a and mixed_wines$density
## t = 41.626, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4394976 0.4778939
## sample estimates:
## cor
## 0.45891
The scatter plot provide us with interesting insights: * The more fixed acidity, the more density even though a lot of data is still concentrated around 6 to 8 fixed acidity.
This is confirmed by a Pearson’s correlation coeficient of 0.45891 which show a strong positive correlation.
The 2 correlations matrices show very interesting insights: * Both wine colors have somne similarities but also some differences. * The quality of white and red wines is strongly positively correlated with the alcohol variable and negatively correlated with volatile acidity. * White wines’ quality is more strongly correlated with density and chlorides. * Red wines’ quality is more strongly correlated with sulphates and citric acid.
That led me to believe that if we would like to create a model to predict the quality, we would need to separate red and white wine even though the color is not correlated with quality somply because the color influence the importance of other variables on the quality.
I think the triple relationship between quality, alcohol and density is interesting since the three variables are correlated. We will need to get rid of either alcohol or density in the future if we want to build a meaningful wine quality prediction in the future.
Based on the correlation coeficients, I think it would make more sense to continue with the alcohol variable rather than the density variable.
Volatile acidity and chlorides both have impact on the quality score.
It’s interesting to see that the color of the wine does not have a impact on the quality but have an indirect impact on which variable are correlated with the quality.
For red wines, sulphates and citric acid is more important whereas for white wines, it is more density and chlorides.
Maybe looking more in details into the newly created ranking variable would be interesting in the future.
Apart from the strong correlation between density and alcohol, density is correlated with several variables.
The strongest relation I found is the density/alcohol relationship with -0.6867454.
Volatile acidity, which was highly correlated with quality on the whole dataset in our previous exploration acutally reveal a stark difference between white and red wines.
Red wines have mostly a volatile acidity above 0.4 whereas white wines have moslty a volatile acidity under 0.4.
As previous results previously hinted, the most correlated with quality variables might be different for the whole dataset, red wines and white wines.
Let’s explore more volatile acidity by differenciating between white and red wines.
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).
The scatter plots provide us with interesting insights: * The volatile acidity of good white wines is between 0.10 and 0.60 and is combined with alcohol above 11. * The volatile acidity of good red wines is between 0.20 and 0.75 and is combined with alcohol above 10. * The volatile acidity of average white wines is lower than the volatile acidity of red wines. * Both white and red average wines have a lower alcohol level. * Bad wines usually have lower alcohol concentration.
Even though volatile acidity is correlated with quality for the total dataset, it seems to have different effect based on the wine color.
##
## Pearson's product-moment correlation
##
## data: subset(mixed_wines, color == "white")$volatile.a and subset(mixed_wines, color == "white")$quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
##
## Pearson's product-moment correlation
##
## data: subset(mixed_wines, color == "red")$volatile.a and subset(mixed_wines, color == "red")$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
We can see that there is an important difference of correlation between quality and white and red wine’s volatile acidity. Red wine’s quality is strongly negatively correlated with volatile acidity (-0.3905578) whereas white wine’s quality is only moderately correlated with volatile acidity (-0.194723).
Let’s have a look at the last variable which is moderately correlated with quality for the whole dataset.
## Warning: Removed 58 rows containing missing values (geom_point).
Once again we can see that there is a clear separation between red and white wines. We will probably have different correlation based on the wine color.
Anyway, the scatter plot gave use several interesting insights: * Red wines have higher chlorides concentration (0.7 to 0.11) than white wines (0.4 to 0.7). * White wines have a larger range of alcohol (8.5 to 13.5) than red wines(9 to 13).
## Warning: Removed 17 rows containing missing values (geom_point).
## Warning: Removed 41 rows containing missing values (geom_point).
The scatter plots give us again very a interesting insight: * Good white wines have lower chlorides (0.02 to 0.05) and stronger alcohol(11 to 14) than good red wines (0.05 to 0.125 for chlorides and 11 to 13 for alcohol)
##
## Pearson's product-moment correlation
##
## data: subset(mixed_wines, color == "white")$chlorides and subset(mixed_wines, color == "white")$quality
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
##
## Pearson's product-moment correlation
##
## data: subset(mixed_wines, color == "red")$chlorides and subset(mixed_wines, color == "red")$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
Once again the correlation with quality is very different for both white and red wines.
White wine’s quality is moderatively negatively correlated with chlorides (-0.2099344) whereas red wine’s quality is only slightly correlated with chlorides (-0.1289066)
It would be difficult to make a general rule for good quality wine without separating it by color.
Even alcohol which is strongly correlated with alcohol quality for both red and white wines would be problematic since we noticed that red wines usually have less alcohol than their white counterpart. Using a model on the whole dataset whitout separating by color would lead to red wines misclassified due to their lower alcohol concentration.
White wines highest correlation according to our correlation matrix was: * Alcohol * Density (that we decided to not use due to its strong correlation to the alcohol variable) * Chlorides * Volatile acidity
Unsurprisingly these are the variables that we tested on the whole dataset.
Why unsurprising? Because there are more white wines than red wines in our dataset and it definietly influence our correlation calculation to “advantage” white wines’ correlated features.
Anyway, what makes a good white wine? * Alcohol above 11. * Volatile Acidity between 0.10 and 0.60. * Chlorides between 0.02 to 0.05.
Let’s investigate more red wines now
##
## Pearson's product-moment correlation
##
## data: subset(mixed_wines, color == "red")$sulphates and subset(mixed_wines, color == "red")$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
This scatter plot give us very interesting insights: * Good red wines have higher sulphate (0.6) at low alcohol (under 11) level than average red wines. * Good red wines can have lower sulphate (0.5) at higher alcohol (above 11) level but they still have more than average.
Red wines’ quality is moderately positive correlated with sulphates (0.2513971).
For reference, white wine is only loosely correlated with sulphates (0.05367788).
##
## Pearson's product-moment correlation
##
## data: subset(mixed_wines, color == "red")$citric.a and subset(mixed_wines, color == "red")$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The scatter plot gave us a very interesting insight: * Good quality red wines usually have higher citric acid (0.25) and alcohol (10) than average or bad red wines.
This is confirmed by the moderately positive correlation (0.2263725) between quality and citric acid.
So, what’s make a good red wine? * Alcohol above 10. * Sulphate above 0.5. * Citric acid above 0.25.
The created color variable seems to be one of the most important variable given the difference between the 2 kind of wines.
I understand better why Udacity decided to keep these 2 dataset separated for this project.
Alcohol strongly influence both white and red wine quality.
I really like this plot because it shows the distribution of quality and alcohol amongst red and white wines. It shows that the alcohol variable affect the quality of the wine for both white and red wine: better quality wines have higher alcohol. It also shows that there are more white wines than red wines and that the range of alcohol for white wine is usally larger.
I really like this chart because it is the first one that draw my attention on the strong difference between white and red wines. We can see that white wines have lower volatile acidity (between 0.1 and 0.5 ppm) whereas red wines have higher one (between 0.35 and 0.8 ppm). I also think that the alcohol range difference is clearer here than in the previous graph. It made me rethink my approach on the whole dataset: I decided not to consider all the wines but rather re-split it into white and red wines for subsequent analysis.
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing missing values (geom_point).
I really like this plot because it confirm what the previous chart hinted: good white and good red wines are different. On this plot we can see that the cluster for good and average wines are really different for white and red wines. White wine cluster for good wines in between 11 to 14 % of alcohol by volume and between 0.10 to 0.6 ppm. In comparison, the good red wine cluster is between 10 and 13% of alcohol by volume and between 0.25 and 0.75 ppm.
First, I understand why the 2 datasets gave been separated into 2 projects. It is more straightforward to go with analyzing only one part of the dataset.
I struggled a bit at the beginning mostly because I felt something wasn’t right with the results obtained with the full dataset. This became apparent when I started plotting multivariate plots. There is a (not so surprising) big difference between red and white wines.
I think it was really interesting to learn R in an independent manner after the lecture and experimenting on different vizualisations.
Let’s keep in mind that these conclusion on what’s make a good wine are based on a limited number of data (6497 in total) and the quality of a wine may vary based on culture, geography and personal taste!
As such, for future exploration I think it would be interesting to have the geographical origin or wine, more wine tester and the country of origin of the tester. It would be fun to explore the dataset and uncover the different taste based on different region. One might even see if taster from a specific geographical area actually have a preference for local wines.
Safe drinking, enjoy with moderation!